Code
library(tidyverse)
library(broom)
library(ggridges)
library(RColorBrewer)
library(knitr)
library(DT)
library(kableExtra)
library(gganimate)
library(gifski)
library(patchwork)Investigating the Relationship Between Income and Sanitation
library(tidyverse)
library(broom)
library(ggridges)
library(RColorBrewer)
library(knitr)
library(DT)
library(kableExtra)
library(gganimate)
library(gifski)
library(patchwork)sanitation.data <- read_csv("at_least_basic_sanitation_overall_access_percent.csv")
income.data <- read_csv("income_per_person_gdppercapita_ppp_inflation_adjusted.csv")
colnames(income.data) <- c("country", as.numeric(1799:2049))We are interested in the relationship between income and sanitation levels. One might assume that as income per person increases, so too will the access to basic necessities, including hygienic products and facilities, such as bathrooms and showers. We aim to analyze data relating to these factors, outlined in the process below.
income.data |>
slice_head(n = 1000) |>
select(country, `1999`:`2002`) |>
datatable(class = "cell-border",
rownames = FALSE,
caption = "Table 1: Preview of Income Data Set, 1999-2002.",
filter = "top")Our income data set measures the total GDP of a country per person for 216 countries between the years of 1892 and 2049, with units in fixed 2017 prices. This number is adjusted for PPP, or the differences between costs of living and essential products, across countries. The data comes from the World Bank, the Maddison Project Database, and the Penn World Table. Historical estimates were used for early years; forecasts from the IMF’s Economic Outlook were used to project income in the future.
Source: https://www.gapminder.org/data/documentation/gd001/
sanitation.data |>
slice_head(n = 1000) |>
select(1:5) |>
datatable(class = "cell-border",
rownames = FALSE,
caption = "Table 2: Preview of Sanitation Data Set, 1999-2002.",
filter = "top")Our sanitation data set measures the percentage of people (living in both urban and rural settings) who use at least basic sanitation services not shared with other households. This includes flushing and pouring to piped sewer systems, septic, and ventilation for pit latrines, such as squat toilets.
The data was collected by the World Health Organization and UNICEF, falling between the years of 1999 to 2019.
Source: https://data.worldbank.org/indicator/SH.STA.SMSS.ZS
We hypothesize that there will be a positive relation between income per person and percentage of people with basic sanitation at their disposal. We would like to be able to predict a country’s percentage of people who have access to basic sanitation services based on that country’s income per person.
We continue our analysis by merging these two data sets to focus on this hypothesized relationship between 1999 and 2019.
The cleaning process requires removing instances of “k” from some data entries, representing thousands of dollars. For example, we had to perform the following conversion: \(12.3k \to 12300\) for such entries. Notice that the decision to drop NA values was made, since ggplot and lm will drop NA values and we have an abundance of data entries. This decision will be more convenient later down the road when taking advantage of the country column.
convert_num <- function(x){ #converts "12.3k" to 12300,
temp <- x
if(str_detect(x, "k$")){
temp <- x |>
str_replace_all("[^[:digit:]]", "") |> #gross but it just replaces any non-number with ""
as.numeric() * 100
}
return(temp)
}
#data cleaning + joining
sanitation.clean <- sanitation.data |>
drop_na() |>
pivot_longer(cols = `1999`:`2019`,
names_to = "year",
values_to = "sanitation")
income.clean <- income.data |>
drop_na() |> #pretty bold step to remove any rows with na vals,
#but due to the abundance of data seems reasonable
select(c(country, `1999`:`2019`)) |>
pivot_longer(cols = `1999`:`2019`,
names_to = "year",
values_to = "income") |>
mutate(income = as.numeric(map(income, convert_num)))#converts characters,
#able to convert after pivot, b/c data is all seen as strings!Below is the final, merged, and cleaned data set.
full.data <- inner_join(sanitation.clean, income.clean) |>
mutate(year = as.numeric(year))
full.data |>
slice_head(n = 1000) |>
datatable(class = "cell-border",
rownames = FALSE,
caption = "Table 3: Preview of Final Data Set.",
filter = "top")We aim to piece together the relationship between income and sanitation levels with a linear model. Our goal is to develop a model that predicts the percent of people with basic sanitation (response) from adjusted income per person (explanatory).
We first visualize the relation between these two variables.
data.plot <- full.data |>
ggplot(mapping = aes(x = income, y = sanitation) ) +
geom_point(alpha = 0.5) +
scale_y_continuous(limits = c(0,100),
breaks = seq(0,100,25)) +
labs(title = "Income versus Sanitation, 1999-2019",
subtitle = "Percentage of People with (at least) Basic Sanitation Services",
x = "Adjusted Income per Person (in 2017 $)",
y = "")
data.plotAlthough the above scatter plot is very messy, we can see a general pattern: countries with low adjusted income per person have lower percentages of people with basic sanitation, whereas countries with high income per person rarely have less than 90 percent of people with basic sanitation.
us.plot <- full.data |>
mutate(year = as.numeric(year)) |>
filter(country == "United States") |>
ggplot(mapping = aes(x = income, y = sanitation)) +
geom_point() +
transition_time(year) +
labs(title = "United States Income vs. Sanitation, 1999-2019",
subtitle = "Year: {floor(frame_time)}",
x = "Adjusted Income Per Person (in 2017 $)",
y = "Percentage of People with (at least) Basic Santitation") +
shadow_mark(alpha = 1,
size = 1)
w = animate(us.plot, duration = 5, fps = 20, renderer = gifski_renderer(), end_pause = 40)
anim_save("animated_US_plot.gif", animation = w)
cuba.plot <- full.data |>
mutate(year = as.numeric(year)) |>
filter(country == "Cuba") |>
ggplot(mapping = aes(x = income, y = sanitation)) +
geom_point() +
transition_time(year) +
labs(title = "Cuba Income vs. Sanitation, 1999-2019",
subtitle = "Year: {floor(frame_time)}",
x = "Adjusted Income Per Person (in 2017 $)",
y = "Percentage of People with (at least) Basic Santitation") +
shadow_mark(alpha = 1,
size = 1)
z = animate(cuba.plot, duration = 5, fps = 20, renderer = gifski_renderer(), end_pause = 40)
anim_save("animated_Cuba_plot.gif", animation = z)However, there are instances where countries with very low income per person have high percentages of people with basic sanitation. Above, you can see that Cuba has a relatively low adjusted income per person in the years between 1999 and 2019, but still enjoyed a sanitation percentage above 88 percent.
On the other hand you can see that despite adjusted income per person dipping below $20,000, the United States’ sanitation percentage never drops below 99 percent. Despite an obvious general trend, further investigation is needed to reach our goals.
Our next step is to look at how relationship between our variables of interest changes over time.
temp.plot <- full.data |>
mutate(year = as.numeric(year)) |>
ggplot(mapping = aes(x = income,
y = sanitation)) +
geom_point(show.legend = F) +
transition_time(year) +
labs(title = "Income versus Sanitation, 1999-2019",
subtitle = "Year: {floor(frame_time)}",
x = "Adjusted Income per Person (in 2017 $)",
y = "Percentage of People with (at least) Basic Sanitation Services") +
shadow_mark(keep_layer = T, size = 0.1, alpha = 0.5) +
enter_fade() +
exit_fade()
anim_time = animate(temp.plot, duration = 10, fps = 10, renderer = gifski_renderer())
anim_save("animated_time_plot.gif", animation = anim_time)It looks like almost every data point experiences increases in percentage of people with basic sanitation services coupled with overall increases in adjusted income per person. This gives us more confidence that our hypothesized relationship between our variables is correct, but more in depth analysis is needed before we can conclude anything.
Upon fitting a simple linear regression model for percent of people who have access to basic sanitation (\(Y\)) based on adjusted income per person (\(X\)), we acquired the following regression equation:
\[\hat{Y} = 55.75 + 0.0009699X\]
data.plot + geom_smooth(method = "lm", color = "red")Our regression equation tells us that a country with an average adjusted income of $0 will have 55.75 percent of its people with access to basic sanitation services. Note that this conclusion is an extrapolation since our data does not contain any information about any countries with a $0 adjusted income per person. Our equation also tells us for every ten thousand dollar increase in adjusted income per person, a country should expect a 9.699 percent increase in the amount of people who have access to basic sanitation.
To assess the validity of our linear model, we look at the variances in observed sanitation percentage, predicted sanitation percentage, and residuals.
model_var <- augment(data.fit) |>
summarise(var.resp = var(sanitation),
var.fitted = var(.fitted),
var.resid =var(.resid))
model_var |>
kable(caption = "Variance in Observed/Predicted/Residual Sanitation.",
col.names = c("Variance in Sanitation", "Variance in Predicted", "Variance in Residuals"),
align = c('l', 'l', 'l') )| Variance in Sanitation | Variance in Predicted | Variance in Residuals |
|---|---|---|
| 942.5548 | 313.8511 | 628.7037 |
#need some writing about this table and to make it look niceThe proportion of variability accounted for by our model was 0.333. Our model does a good enough job to communicate the general idea that the more money people are making, the make likely it is that they will have access to basic sanitation services. However, the variable we are hoping to predict is a percentage which means it can not exceed 100%, exposing the flaw that our model will predict impossible to reach sanitation percentages for high enough average income inputs. This gives us problems regarding what we can and can not conclude due to risk of extrapolating further than our data provides. There is not much we can do about this using a simple linear model as we would have to introduce a polynomial regression equation (say, through a logarithmic transformation) to be more precise.
data_predict <- predict(data.fit) #predicted values
data_sig <- sigma(data.fit) #s or residual std error
noise <- function(x, mean = 0, sd){ #stolen from Prof Robinson :,)
x + rnorm(length(x),
mean,
sd)
}
sim_response <- tibble(sanitation = noise(data_predict,
sd = data_sig)
)#what is this?
full.dist <- full.data |>
ggplot(aes(x = sanitation)) +
geom_histogram() +
labs(x = "Observed Sanitation Index",
y = "",
subtitle = "Count") +
theme_bw()
sim.dist <- sim_response |>
ggplot(aes(x = sanitation)) +
geom_histogram() +
labs(x = "Simulated Sanitation Index",
y = "",
subtitle = "Count") +
theme_bw()
full.dist + sim.distsim.plot <- sim_response |>
ggplot(mapping = aes(x = full.data$income, y = sanitation) ) +
geom_point(alpha = 0.5) +
scale_y_continuous(limits = c(0,100),
breaks = seq(0,100,25)) +
labs(title = "Predicted Sanitation versus Income, 1999-2019",
subtitle = "Percentage of People with (at least) Basic Sanitation Services",
x = "Adjusted Income per Person (in 2017 $)",
y = "")
sim.plot + data.plot#need to make same plot as data vis in beginning. idk how to get income values so i just put the originalsSimilarities and differences in the predicted data vs observed data.
nsims <- 1000
sims <- map_dfc(.x = 1:nsims,
.f = ~ tibble(sim = noise(data_predict,
sd = data_sig)
)
)
colnames(sims) <- colnames(sims) |>
str_replace(pattern = "\\.\\.\\.",
replace = "_")
temp <- full.data |>
select(sanitation) |>
bind_cols(sims)
sim_r_sq <- temp |>
map(~ lm(sanitation ~ .x, data = temp)) |>
map(glance) |>
map_dbl(~ .x$r.squared)#removes first col
tibble(sims = sim_r_sq[-1]) |>
ggplot(aes(x = sims)) +
geom_histogram(binwidth = 0.0025) +
labs(titel = "Distribution of $R^2$",
x = expression("Simulated"~ R^2),
y = "",
subtitle = "Number of Simulated Models")discuss implications of above plot how well does our model generate data similar to observed
Log Transformation tests below, beware as var names below overwrite those above.
temp.data <- full.data
temp.data$income <- log(temp.data$income, base = 2)
temp.data |>
ggplot(mapping = aes(x = income, y = sanitation) ) +
geom_point() +
geom_smooth(method = lm, level = 0, color = "red") +
scale_y_continuous(limits = c(0,100),
breaks = seq(0,100,25)) +
labs(title = "Income versus Sanitation, 1999-2049",
subtitle = "Percentage of People with (at least) Basic Sanitation Services",
x = "Adjusted Income per Person (in 2017 $)",
y = "")temp.fit <- lm(sanitation~income, data = temp.data)
summary(temp.fit)
Call:
lm(formula = sanitation ~ income, data = temp.data)
Residuals:
Min 1Q Median 3Q Max
-50.590 -12.503 -0.889 11.002 66.830
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -102.2536 2.5852 -39.55 <2e-16 ***
income 13.3280 0.1969 67.69 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20.25 on 3526 degrees of freedom
Multiple R-squared: 0.5651, Adjusted R-squared: 0.565
F-statistic: 4582 on 1 and 3526 DF, p-value: < 2.2e-16
data_predict <- predict(temp.fit) #predicted values
data_sig <- sigma(temp.fit) #s or residual std error
noise <- function(x, mean = 0, sd){ #stolen from Prof Robinson :,)
x + rnorm(length(x),
mean,
sd)
}
sim_response <- tibble(sanitation = noise(data_predict,
sd = data_sig)
)
head(sim_response)# A tibble: 6 × 1
sanitation
<dbl>
1 23.0
2 -10.1
3 36.6
4 24.0
5 27.9
6 29.4
temp.dist <- temp.data |>
ggplot(aes(x = sanitation)) +
geom_histogram() +
labs(x = "Observed Sanitation Index",
y = "",
subtitle = "Count") +
theme_bw()
sim.dist <- sim_response |>
ggplot(aes(x = sanitation)) +
geom_histogram() +
labs(x = "Simulated Sanitation Index",
y = "",
subtitle = "Count") +
theme_bw()
temp.dist + sim.distnsims <- 1000
sims <- map_dfc(.x = 1:nsims,
.f = ~ tibble(sim = noise(data_predict,
sd = data_sig)
)
)
colnames(sims) <- colnames(sims) |>
str_replace(pattern = "\\.\\.\\.",
replace = "_")
head(sims)# A tibble: 6 × 1,000
sim_1 sim_2 sim_3 sim_4 sim_5 sim_6 sim_7 sim_8 sim_9 sim_10 sim_11 sim_12
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 35.3 21.8 28.0 20.4 8.95 -3.30 28.6 33.6 -0.496 -14.1 8.67 53.1
2 24.9 -3.03 18.7 30.4 40.2 29.1 28.2 3.97 26.6 68.6 22.0 21.9
3 7.72 14.3 26.2 47.2 12.5 14.7 66.6 32.0 -10.2 47.6 18.7 12.5
4 52.8 37.7 19.9 38.2 20.3 31.3 16.1 37.7 36.7 61.2 -0.794 35.4
5 23.0 5.93 21.7 45.9 21.1 32.3 41.9 44.8 58.4 29.3 78.0 28.5
6 37.6 23.7 28.5 24.8 64.1 38.0 33.6 18.2 46.7 16.3 17.7 47.1
# … with 988 more variables: sim_13 <dbl>, sim_14 <dbl>, sim_15 <dbl>,
# sim_16 <dbl>, sim_17 <dbl>, sim_18 <dbl>, sim_19 <dbl>, sim_20 <dbl>,
# sim_21 <dbl>, sim_22 <dbl>, sim_23 <dbl>, sim_24 <dbl>, sim_25 <dbl>,
# sim_26 <dbl>, sim_27 <dbl>, sim_28 <dbl>, sim_29 <dbl>, sim_30 <dbl>,
# sim_31 <dbl>, sim_32 <dbl>, sim_33 <dbl>, sim_34 <dbl>, sim_35 <dbl>,
# sim_36 <dbl>, sim_37 <dbl>, sim_38 <dbl>, sim_39 <dbl>, sim_40 <dbl>,
# sim_41 <dbl>, sim_42 <dbl>, sim_43 <dbl>, sim_44 <dbl>, sim_45 <dbl>, …
temp <- temp.data |>
select(sanitation) |>
bind_cols(sims)
sim_r_sq <- temp |>
map(~ lm(sanitation ~ .x, data = temp)) |>
map(glance) |>
map_dbl(~ .x$r.squared)#removes first col
tibble(sims = sim_r_sq[-1]) |>
ggplot(aes(x = sims)) +
geom_histogram(binwidth = 0.0025) +
labs(x = expression("Simulated"~ R^2),
y = "",
subtitle = "Number of Simulated Models")